Beta Distribution — Modeling Uncertain Probabilities#
The beta distribution is the workhorse distribution for random variables constrained to \([0,1]\). It is especially important for modeling uncertain probabilities (e.g., conversion rates, success probabilities, fractions).
What you’ll learn#
what the beta distribution models and why it’s useful
the PDF/CDF and how it relates to the beta function and gamma function
closed-form moments (mean/variance/skewness/kurtosis), MGF/CF, and entropy
how \((lpha,eta)\) control shape, mean, and concentration
a NumPy-only sampler (via Gamma sampling) + Monte Carlo validation
practical usage via
scipy.stats.beta(pdf,cdf,rvs,fit)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from scipy import stats, special
# Plotly rendering (CKC convention)
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
# Reproducibility
rng = np.random.default_rng(7)
np.set_printoptions(precision=4, suppress=True)
1) Title & Classification#
Name: Beta distribution
Type: Continuous
Support: \(x \in [0, 1]\) (density is defined for \(0 < x < 1\))
Parameters: shape parameters \(lpha > 0\), \(eta > 0\)
We write:
2) Intuition & Motivation#
2.1 What it models#
The beta distribution models a random proportion or random probability. If you know a quantity must live in \([0,1]\) (a fraction, rate, probability, mixture weight), the beta distribution is often the first candidate.
2.2 Real-world use cases#
Conversion rate / click-through rate: unknown probability of success
A/B testing: uncertainty over two proportions
Reliability / defect rates: fraction of items that fail
Normalized quantities: e.g., fraction of budget, fraction of time spent
Random mixing weights: blend two components with a random weight \(w \in [0,1]\)
2.3 Relations to other distributions#
Conjugate prior for Bernoulli/Binomial likelihoods (Beta–Bernoulli / Beta–Binomial).
Special cases:
\(\mathrm{Beta}(1,1)\) is Uniform on \([0,1]\).
\(\mathrm{Beta}( frac{1}{2}, frac{1}{2})\) is the arcsine distribution.
Order statistics: if \(U_{(k)}\) is the \(k\)-th order statistic of \(n\) iid Uniform\((0,1)\) samples, $\(U_{(k)} \sim \mathrm{Beta}(k, n+1-k).\)$
Gamma connection: if \(G_1 \sim \mathrm{Gamma}(lpha,1)\) and \(G_2 \sim \mathrm{Gamma}(eta,1)\) independently, $\(rac{G_1}{G_1 + G_2} \sim \mathrm{Beta}(lpha,eta).\)$
Dirichlet: Beta is the 2D special case of a Dirichlet distribution.
3) Formal Definition#
3.1 Beta function#
The beta function is
3.2 PDF#
For \(0 < x < 1\):
3.3 CDF#
The CDF can be written using the regularized incomplete beta function \(I_x(lpha,eta)\):
In SciPy, the CDF is computed via this special function.
def beta_pdf(x: np.ndarray, alpha: float, beta: float) -> np.ndarray:
'''Numerically stable beta PDF via log-space (uses SciPy special functions).'''
x = np.asarray(x, dtype=float)
log_pdf = (
(alpha - 1) * np.log(x)
+ (beta - 1) * np.log1p(-x)
- special.betaln(alpha, beta)
)
return np.exp(log_pdf)
def beta_cdf(x: np.ndarray, alpha: float, beta: float) -> np.ndarray:
'''Beta CDF via regularized incomplete beta I_x(alpha, beta).'''
x = np.asarray(x, dtype=float)
return special.betainc(alpha, beta, x)
# Quick sanity check: PDF integrates to ~1
alpha0, beta0 = 2.0, 5.0
xgrid = np.linspace(1e-6, 1 - 1e-6, 200_000)
area = np.trapz(beta_pdf(xgrid, alpha0, beta0), xgrid)
area
0.9999999999225003
4) Moments & Properties#
4.1 Mean, variance, skewness, kurtosis#
For \(X \sim \mathrm{Beta}(lpha,eta)\):
Mean $\(\mathbb{E}[X] = rac{lpha}{lpha+eta}.\)$
Variance $\(\mathrm{Var}(X) = rac{lphaeta}{(lpha+eta)^2(lpha+eta+1)}.\)$
Skewness $\(\gamma_1 = rac{2(eta-lpha)\sqrt{lpha+eta+1}}{(lpha+eta+2)\sqrt{lphaeta}}.\)$
Excess kurtosis (kurtosis minus 3) $\(\gamma_2 = rac{6ig[(lpha-eta)^2(lpha+eta+1) - lphaeta(lpha+eta+2)ig]} {lphaeta(lpha+eta+2)(lpha+eta+3)}.\)$
A useful compact identity for raw moments is:
where \((a)_k = a(a+1)\cdots(a+k-1)\) is the rising factorial (Pochhammer symbol).
4.2 Mode#
If \(lpha > 1\) and \(eta > 1\) (unimodal interior case),
4.3 MGF and characteristic function#
Because the support is bounded, the MGF exists for all real \(t\):
where \({}_1F_1\) is the confluent hypergeometric function. The characteristic function is
4.4 Differential entropy#
The differential entropy is
where \(\psi\) is the digamma function.
def beta_moments(alpha: float, beta: float) -> dict:
a, b = float(alpha), float(beta)
mean = a / (a + b)
var = a * b / ((a + b) ** 2 * (a + b + 1))
skew = 2 * (b - a) * np.sqrt(a + b + 1) / ((a + b + 2) * np.sqrt(a * b))
excess_kurt = (
6
* (((a - b) ** 2) * (a + b + 1) - a * b * (a + b + 2))
/ (a * b * (a + b + 2) * (a + b + 3))
)
mode = np.nan
if a > 1 and b > 1:
mode = (a - 1) / (a + b - 2)
mgf = lambda t: special.hyp1f1(a, a + b, t)
entropy = (
special.betaln(a, b)
- (a - 1) * special.digamma(a)
- (b - 1) * special.digamma(b)
+ (a + b - 2) * special.digamma(a + b)
)
return {
"mean": mean,
"var": var,
"skew": skew,
"excess_kurtosis": excess_kurt,
"mode": mode,
"entropy": entropy,
"mgf": mgf,
}
m = beta_moments(alpha0, beta0)
{k: v for k, v in m.items() if k != "mgf"}
{'mean': 0.2857142857142857,
'var': 0.025510204081632654,
'skew': 0.5962847939999439,
'excess_kurtosis': -0.12,
'mode': 0.2,
'entropy': -0.48453071499548805}
# Monte Carlo check of mean/variance + MGF at a few t
n = 200_000
samples_scipy = stats.beta(alpha0, beta0).rvs(size=n, random_state=rng)
mc_mean = samples_scipy.mean()
mc_var = samples_scipy.var(ddof=0)
mc_mgf_1 = np.mean(np.exp(1.0 * samples_scipy))
mc_mgf_m1 = np.mean(np.exp(-1.0 * samples_scipy))
(
m["mean"],
mc_mean,
m["var"],
mc_var,
m["mgf"](1.0),
mc_mgf_1,
m["mgf"](-1.0),
mc_mgf_m1,
)
(0.2857142857142857,
0.2850010733170977,
0.025510204081632654,
0.02540504391457564,
1.3483340379497217,
1.3473002240829173,
0.7608141393691706,
0.7613179409507954)
5) Parameter Interpretation#
The parameters \((lpha,eta)\) control both where the mass is (mean) and how concentrated it is.
5.1 Mean–concentration reparameterization#
A common and very interpretable reparameterization uses:
mean \(m \in (0,1)\)
concentration \(\kappa > 0\)
with
Increasing \(\kappa\) (holding \(m\) fixed) makes the distribution tighter around \(m\).
Moving \(m\) (holding \(\kappa\) fixed) shifts the mass left/right.
5.2 Shape regimes#
\(lpha = eta = 1\): uniform
\(lpha,eta > 1\): unimodal, finite density at boundaries
\(lpha < 1\) or \(eta < 1\): density diverges near 0 and/or 1 (still integrable)
\(lpha = eta\): symmetric around 0.5
x = np.linspace(1e-4, 1 - 1e-4, 600)
param_sets = [
(0.5, 0.5, "U-shaped (0.5,0.5)"),
(1.0, 1.0, "Uniform (1,1)"),
(2.0, 2.0, "Symmetric peak (2,2)"),
(2.0, 5.0, "Skewed right (2,5)"),
(5.0, 2.0, "Skewed left (5,2)"),
(8.0, 1.5, "Mass near 1 (8,1.5)"),
]
fig = go.Figure()
for a, b, label in param_sets:
fig.add_trace(go.Scatter(x=x, y=beta_pdf(x, a, b), mode="lines", name=label))
fig.update_layout(
title="Beta PDF for different (α, β)",
xaxis_title="x",
yaxis_title="density",
width=900,
height=450,
)
fig
# Same mean, different concentration κ
m_fixed = 0.3
kappas = [2, 5, 20, 100]
fig = go.Figure()
for kappa in kappas:
a = kappa * m_fixed
b = kappa * (1 - m_fixed)
fig.add_trace(go.Scatter(x=x, y=beta_pdf(x, a, b), mode="lines", name=f"κ={kappa}"))
fig.update_layout(
title="Same mean (m=0.3), increasing concentration κ",
xaxis_title="x",
yaxis_title="density",
width=900,
height=420,
)
fig
6) Derivations#
6.1 Expectation#
Starting from the definition:
egin{align}
\mathbb{E}[X]
&= \int_0^1 x,f(x\midlpha,eta),dx
&= rac{1}{B(lpha,eta)}\int_0^1 x^{lpha}(1-x)^{eta-1},dx
&= rac{B(lpha+1,eta)}{B(lpha,eta)}.
\end{align}
Using \(B(lpha,eta)=rac{\Gamma(lpha)\Gamma(eta)}{\Gamma(lpha+eta)}\):
egin{align}
rac{B(lpha+1,eta)}{B(lpha,eta)}
&= rac{\Gamma(lpha+1)\Gamma(eta),\Gamma(lpha+eta)}{\Gamma(lpha+eta+1),\Gamma(lpha)\Gamma(eta)}
&= rac{lpha,\Gamma(lpha),\Gamma(lpha+eta)}{(lpha+eta),\Gamma(lpha+eta)}
&= rac{lpha}{lpha+eta}.
\end{align}
6.2 Variance#
Compute \(\mathbb{E}[X^2]\) similarly:
egin{align}
\mathbb{E}[X^2]
&= rac{1}{B(lpha,eta)}\int_0^1 x^{lpha+1}(1-x)^{eta-1},dx
&= rac{B(lpha+2,eta)}{B(lpha,eta)}
= rac{lpha(lpha+1)}{(lpha+eta)(lpha+eta+1)}.
\end{align}
Then
6.3 Likelihood (iid sample)#
For data \(x_1,\ldots,x_n \in (0,1)\) iid from \(\mathrm{Beta}(lpha,eta)\):
egin{align}
L(lpha,eta) &= \prod_{i=1}^n rac{1}{B(lpha,eta)} x_i^{lpha-1}(1-x_i)^{eta-1}
\ell(lpha,eta) &= \log L(lpha,eta)
&= -n\log B(lpha,eta) + (lpha-1)\sum_{i=1}^n \log x_i + (eta-1)\sum_{i=1}^n \log(1-x_i).
\end{align}
Maximizing \(\ell(lpha,eta)\) (MLE) has no closed-form solution in general; it’s typically solved numerically.
SciPy’s beta.fit does this for you.
def beta_loglikelihood(x: np.ndarray, alpha: float, beta: float) -> float:
x = np.asarray(x, dtype=float)
if np.any((x <= 0) | (x >= 1)):
return -np.inf
n = x.size
return (
-n * special.betaln(alpha, beta)
+ (alpha - 1) * np.sum(np.log(x))
+ (beta - 1) * np.sum(np.log1p(-x))
)
# Example log-likelihood value
beta_loglikelihood(samples_scipy[:1000], alpha0, beta0)
493.07429490684376
7) Sampling & Simulation#
A numerically convenient way to sample from a beta distribution uses the Gamma ratio identity:
Sample \(G_1 \sim \mathrm{Gamma}(lpha, 1)\)
Sample \(G_2 \sim \mathrm{Gamma}(eta, 1)\)
Return $\(X = rac{G_1}{G_1 + G_2}.\)$
So the core task is: sample Gamma(shape, scale=1) using only NumPy.
7.1 Gamma sampling: Marsaglia–Tsang#
For shape \(k \ge 1\), Marsaglia & Tsang (2000) propose an efficient acceptance–rejection method. For \(k < 1\), we can use the reduction:
We implement this below.
def gamma_rvs_numpy(shape: float, size: int, rng: np.random.Generator) -> np.ndarray:
'''Sample Gamma(shape, scale=1) using NumPy only (Marsaglia-Tsang).
Parameters
----------
shape:
k > 0
size:
number of samples
rng:
NumPy Generator
'''
k = float(shape)
if k <= 0:
raise ValueError("shape must be > 0")
# k < 1: boost to k+1 and apply power transform
if k < 1:
g = gamma_rvs_numpy(k + 1.0, size, rng)
u = rng.random(size)
return g * (u ** (1.0 / k))
# k >= 1: Marsaglia-Tsang
d = k - 1.0 / 3.0
c = 1.0 / np.sqrt(9.0 * d)
out = np.empty(size, dtype=float)
filled = 0
while filled < size:
n = size - filled
x = rng.standard_normal(n)
v = (1.0 + c * x)
v = v * v * v # (1 + c x)^3
u = rng.random(n)
positive = v > 0
# First (cheap) acceptance
accept = positive & (u < 1.0 - 0.0331 * (x**4))
# Second acceptance (log test) - compute log(v) only where v > 0 to avoid warnings
log_v = np.zeros_like(v)
log_v[positive] = np.log(v[positive])
accept2 = positive & (~accept) & (
np.log(u) < 0.5 * x * x + d * (1.0 - v + log_v)
)
accept = accept | accept2
accepted = d * v[accept]
take = min(accepted.size, n)
out[filled : filled + take] = accepted[:take]
filled += take
return out
def beta_rvs_numpy(alpha: float, beta: float, size: int, rng: np.random.Generator) -> np.ndarray:
'''Sample Beta(alpha, beta) using Gamma ratio with NumPy-only Gamma sampler.'''
g1 = gamma_rvs_numpy(alpha, size, rng)
g2 = gamma_rvs_numpy(beta, size, rng)
return g1 / (g1 + g2)
# Monte Carlo validation against theory
n = 200_000
samples_numpy = beta_rvs_numpy(alpha0, beta0, n, rng)
np.mean(samples_numpy), np.var(samples_numpy), m["mean"], m["var"]
(0.2857694499926869,
0.025682506385920862,
0.2857142857142857,
0.025510204081632654)
# Compare NumPy-only sampler to SciPy sampler (quick KS test)
ks = stats.ks_2samp(samples_numpy[:20_000], samples_scipy[:20_000])
ks
KstestResult(statistic=0.00824999999999998, pvalue=0.5014359645922872, statistic_location=0.2744678864589129, statistic_sign=-1)
8) Visualization#
We’ll visualize:
the theoretical PDF and CDF
Monte Carlo samples from our NumPy-only sampler
# PDF + histogram (Monte Carlo)
x = np.linspace(1e-4, 1 - 1e-4, 800)
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=samples_numpy,
nbinsx=60,
histnorm="probability density",
name="Monte Carlo (NumPy-only)",
opacity=0.55,
)
)
fig.add_trace(
go.Scatter(
x=x,
y=stats.beta(alpha0, beta0).pdf(x),
mode="lines",
name="True PDF (SciPy)",
line=dict(width=3),
)
)
fig.update_layout(
title=f"Beta({alpha0}, {beta0}): histogram vs PDF",
xaxis_title="x",
yaxis_title="density",
width=900,
height=420,
)
fig
# CDF: theoretical vs empirical
x = np.linspace(0, 1, 600)
emp_x = np.sort(samples_numpy)
emp_cdf = np.arange(1, emp_x.size + 1) / emp_x.size
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=stats.beta(alpha0, beta0).cdf(x), mode="lines", name="True CDF"))
fig.add_trace(
go.Scatter(
x=emp_x[::200],
y=emp_cdf[::200],
mode="markers",
name="Empirical CDF (subsampled)",
marker=dict(size=4, opacity=0.6),
)
)
fig.update_layout(
title=f"Beta({alpha0}, {beta0}): theoretical CDF vs empirical CDF",
xaxis_title="x",
yaxis_title="CDF",
width=900,
height=420,
)
fig
9) SciPy Integration (scipy.stats.beta)#
SciPy parameterization:
stats.beta(a=alpha, b=beta, loc=0, scale=1)
a,bare the shape parameters \(lpha\), \(eta\).locandscaleallow you to shift/scale the support from \([0,1]\) to \([ ext{loc}, ext{loc}+ ext{scale}]\).
dist = stats.beta(alpha0, beta0) # loc=0, scale=1 by default
x = np.linspace(0, 1, 6)
pdf = dist.pdf(x)
cdf = dist.cdf(x)
samples = dist.rvs(size=5, random_state=rng)
pdf, cdf, samples
(array([0. , 2.4576, 1.5552, 0.4608, 0.0384, 0. ]),
array([0. , 0.3446, 0.7667, 0.959 , 0.9984, 1. ]),
array([0.3445, 0.4805, 0.0498, 0.2403, 0.4281]))
# Fitting (MLE) with SciPy
# If you KNOW the data live on [0, 1], it's common to fix loc=0 and scale=1.
a_hat, b_hat, loc_hat, scale_hat = stats.beta.fit(samples_numpy[:10_000], floc=0, fscale=1)
a_hat, b_hat, loc_hat, scale_hat
(1.9849242586340088, 4.951405202339398, 0, 1)
10) Statistical Use Cases#
10.1 Hypothesis testing / confidence intervals for proportions#
A classic exact confidence interval for a Binomial proportion (Clopper–Pearson) uses Beta quantiles.
If \(K \sim \mathrm{Binomial}(n, p)\) and you observe \(k\) successes, a two-sided \((1-lpha)\) interval is:
lower endpoint: \(\mathrm{Beta}^{-1}(lpha/2;\; k,\; n-k+1)\)
upper endpoint: \(\mathrm{Beta}^{-1}(1-lpha/2;\; k+1,\; n-k)\)
(where \(\mathrm{Beta}^{-1}\) is the inverse CDF / quantile function).
10.2 Bayesian modeling (Beta–Bernoulli / Beta–Binomial)#
Let \(p\) be an unknown success probability.
Prior: \(p \sim \mathrm{Beta}(lpha_0,eta_0)\)
Data: \(k\) successes and \(n-k\) failures
Posterior is conjugate:
This is often interpreted as adding pseudo-counts to observed counts.
10.3 Generative modeling#
Sample random probabilities \(p\) for Bernoulli events.
Sample mixing weights \(w\) for two-component mixtures.
Generalization: Dirichlet for \(K\)-way mixture weights.
# Example: Clopper–Pearson interval
n = 100
k = 37
alpha_level = 0.05
if k == 0:
cp_low = 0.0
else:
cp_low = stats.beta.ppf(alpha_level / 2, k, n - k + 1)
if k == n:
cp_high = 1.0
else:
cp_high = stats.beta.ppf(1 - alpha_level / 2, k + 1, n - k)
(cp_low, cp_high)
(0.2755665796145515, 0.47235164055168316)
# Example: Bayesian update for a Bernoulli probability
alpha_prior, beta_prior = 2.0, 2.0
alpha_post = alpha_prior + k
beta_post = beta_prior + (n - k)
prior = stats.beta(alpha_prior, beta_prior)
post = stats.beta(alpha_post, beta_post)
x = np.linspace(1e-4, 1 - 1e-4, 600)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=prior.pdf(x), mode="lines", name=f"Prior Beta({alpha_prior:.0f},{beta_prior:.0f})"))
fig.add_trace(go.Scatter(x=x, y=post.pdf(x), mode="lines", name=f"Posterior Beta({alpha_post:.0f},{beta_post:.0f})", line=dict(width=3)))
fig.update_layout(
title=f"Bayesian update (n={n}, k={k})",
xaxis_title="p",
yaxis_title="density",
width=900,
height=420,
)
fig
# Posterior probability of beating a threshold (Bayesian hypothesis-style query)
threshold = 0.4
post_prob = 1 - post.cdf(threshold)
post_mean = post.mean()
post_prob, post_mean
(0.29530027558732863, 0.375)
11) Pitfalls#
Invalid parameters: \(lpha \le 0\) or \(eta \le 0\) is not a valid beta distribution.
Boundary data (0 or 1):
The continuous Beta model assigns probability 0 to exact endpoints.
Log-likelihood involves \(\log x\) and \(\log(1-x)\), which become \(-\infty\) at 0 or 1.
If your data contain many exact 0/1 values, consider a zero/one-inflated beta model or add a measurement model.
Numerical issues near 0 or 1:
For \(lpha<1\) or \(eta<1\), the density can diverge near the boundaries; use log-space (
betaln,logpdf).When plotting, avoid evaluating exactly at 0 or 1; use small epsilons.
Fitting:
stats.beta.fitwill estimatelocandscaleunless fixed.If data are already in \([0,1]\), using
floc=0, fscale=1avoids spurious shifts/scales.
12) Summary#
The beta distribution is a flexible family on \([0,1]\), ideal for modeling uncertain probabilities.
\((lpha,eta)\) control shape; \(m=lpha/(lpha+eta)\) is the mean and \(\kappa=lpha+eta\) acts like a concentration.
Key formulas (mean/variance/skewness/kurtosis) are closed-form; the CDF uses the regularized incomplete beta function.
Sampling is easy via the Gamma ratio identity; SciPy provides a robust reference implementation (
scipy.stats.beta).
References
Marsaglia, G. & Tsang, W. W. (2000). A Simple Method for Generating Gamma Variables.
SciPy documentation:
scipy.stats.beta,scipy.special.betainc,scipy.special.hyp1f1.